Warning in fun(libname, pkgname): couldn't connect to display ":0"
1 Introduction
• The context and background: course, company name, business context.
During our 1st master year as students in Management - orientation Business Analytics, we have had the opportunity to attend some lectures of Machine Learning for Business Analytics. In content of this class, we have seen multiple machine learning techniques for business context, mainly covering supervised (regressions, trees, support vector machine, neural networks) and unsupervised methods (clustering, PCA, FAMD, Auto-Encoder) but also other topics such as data splitting, ensemble methods and metrics.
• Aim of the investigation: major terms should be defined, the question of research (more generally the issue), why it is of interest and relevant in that context.
In the context of this class, our group have had the opportunity to work on an applied project. From scratch, we had to look for some potential dataset for using on real cases what we have learned in class. Thus, we had found an interesting dataset concerning vehicule MPG, range, engine stats and more, for more than 100 brands. The goal of our research was to predict the make (i.e. the brand) of the car according to its characteristics (consumption, range, fuel type, … ) thanks to a model that we would have trained (using RF, ANN or Trees). As some cars could have several identical characteristics, but could differentiate on various other ones, we thought that it would be pertinent to have a model that was able to predict a car brand, from its features.
• Description of the data and the general material provided and how it was made available (and/or collected, if it is relevant). Only in broad terms however, the data will be further described in a following section. Typically, the origin/source of the data (the company, webpage, etc.), the type of files (Excel files, etc.), and what it contains in broad terms (e.g. “a file containing weekly sales with the factors of interest including in particular the promotion characteristics”).
The csv dataset has been found on data.world, a data catalog platform that gather various open access datasets online. The file contains more than 45’000 rows and 26 columns, each colomn concerning one feature (such as the year of the brand, the model, the consumption per barrel, the highway mpg per fuel type and so on).
• The method that is used, in broad terms, no details needed at this point. E.g. “Model based machine learning will help us quantifying the important factors on the sales”.
Among these columns, we have had to find a machine learning model that could help us quantify the importance of the features in predicting the make of the car. Various models will be tried for both supervised and unsupervised learnings.
• An outlook: a short paragraph indicating from now what will be treated in each following sections/chapters. E.g. “in Section 3, we describe the data. Section 4 is dedicated to the presentation of the text mining methods…” In the following sections, you will find 1st the description in the data, then in Section 2 the method used, in Section 3 the results, in Section 4 our conclusion and recommendations and finally in Section 5 our references. From now on, we will go through different sections. Section 2 will be dedicated in the data description in more depth, mentioning the variables and features, the instances, the type of data and eventually some missing data patterns. Then, the next section will cover Exploratory Data Analysis (EDA), where some vizualisations will be made in order to better perceive some patterns in the variables as well as potential correlation. After that, section 4 will be about the methods which will first be divided between Supervised and then Unsupervised in order to find a suitable model for our project. The results will be discussed right after and we will proceed with a conclusion, as well as recommendations and discussions. Finally, the references and appendix will be visible at the end of the report.
2 Data description
- Description of the data file format (xlsx, csv, text, video, etc.) DONE
- The features or variables: type, units, the range (e.g. the time, numerical, in weeks from January 1, 2012 to December 31, 2015), their coding (numerical, the levels for categorical, etc.), etc. TABLE-NTBF
- The instances: customers, company, products, subjects, etc. DONE
- Missing data pattern: if there are missing data, if they are specific to some features, etc. NTBD
- Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc. NTBD
- If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications. NTBD
“For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in ?@sec-Annex.”
2.0.1 The features or variables: type, units,…
| Variable Name | Explanation |
|---|---|
| ID | Number corresponding to the precise combination of the features of the model |
| Model Year | Year of the model of the car |
| Make | The brand of the car |
| Model | The model of the car |
| Estimated Annual Petroleum Consumption (Barrels) | Consumption in Petroleum Barrels |
| Fuel Type 1 | First fuel energy source, only source if not an hybrid car |
| City MPG (Fuel Type 1) | |
| Highway MPG (Fuel Type 1) | |
| Combined MPG (Fuel Type 1) | |
| Fuel Type 2 | Second energy source if hybrid car |
| City MPG (Fuel Type 2) | |
| Highway MPG (Fuel Type 2) | |
| Combined MPG (Fuel Type 2) | |
| Engine Cylinders | From 2 to 16 cylinders |
| Engine Displacement | Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers |
| Drive | Description of the car, e.g. Turbo, Stop-Start, ... |
| Engine Description | Manual/Automatic transmission, with number of gears and/or model of transmission |
| Transmission | e.g. Minivan, Trucks, Midsize,.... |
| Vehicle Class | |
| Time to Charge EV (hours at 120v) | |
| Time to Charge EV (hours at 240v) | |
| Range (for EV) | |
| City Range (for EV - Fuel Type 1) | |
| City Range (for EV - Fuel Type 2) | |
| Hwy Range (for EV - Fuel Type 1) | |
| Hwy Range (for EV - Fuel Type 2) |
2.1 The instances: customers, company, products, subjects, etc.
In a basic instance, each row is concerning one car. We can find in order the ID of the car corresponding to a precise feature observation, then the features as seen in the table before.
2.2 Missing data pattern: if there are missing data, if they are specific to some features, etc.
2.3 Any modification to the initial data: aggregation, imputation in replacement of missing data, recoding of levels, etc.
2.4 If only a subset was used, it should be mentioned and explained; e.g. inclusion criteria. Note that if inclusion criteria do not exist and the inclusion was an arbitrary choice, it should be stated as such. One should not try to invent unreal justifications.
EDA:
Columns description
To begin with our EDA, let’s have a look at our dataset and in particular the characteristics of the columns.
Show the code
#to get a detailed summary
skim(data)| Name | data |
| Number of rows | 45896 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1.00 | 3 | 34 | 0 | 141 | 0 |
| Model | 0 | 1.00 | 1 | 47 | 0 | 4762 | 0 |
| Fuel Type 1 | 0 | 1.00 | 6 | 17 | 0 | 6 | 0 |
| Fuel Type 2 | 44059 | 0.04 | 3 | 11 | 0 | 4 | 0 |
| Drive | 1186 | 0.97 | 13 | 26 | 0 | 7 | 0 |
| Engine Description | 17031 | 0.63 | 1 | 46 | 0 | 589 | 0 |
| Transmission | 11 | 1.00 | 12 | 32 | 0 | 40 | 0 |
| Vehicle Class | 0 | 1.00 | 4 | 34 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 23102.11 | 13403.10 | 1.00 | 11474.75 | 23090.50 | 34751.25 | 46332.00 | ▇▇▇▇▇ |
| Model Year | 0 | 1.00 | 2003.61 | 12.19 | 1984.00 | 1992.00 | 2005.00 | 2015.00 | 2023.00 | ▇▆▆▇▇ |
| Estimated Annual Petrolum Consumption (Barrels) | 0 | 1.00 | 15.33 | 4.34 | 0.05 | 12.94 | 14.88 | 17.50 | 42.50 | ▁▇▃▁▁ |
| City MPG (Fuel Type 1) | 0 | 1.00 | 19.11 | 10.31 | 6.00 | 15.00 | 17.00 | 21.00 | 150.00 | ▇▁▁▁▁ |
| Highway MPG (Fuel Type 1) | 0 | 1.00 | 25.16 | 9.40 | 9.00 | 20.00 | 24.00 | 28.00 | 140.00 | ▇▁▁▁▁ |
| Combined MPG (Fuel Type 1) | 0 | 1.00 | 21.33 | 9.78 | 7.00 | 17.00 | 20.00 | 23.00 | 142.00 | ▇▁▁▁▁ |
| City MPG (Fuel Type 2) | 0 | 1.00 | 0.85 | 6.47 | 0.00 | 0.00 | 0.00 | 0.00 | 145.00 | ▇▁▁▁▁ |
| Highway MPG (Fuel Type 2) | 0 | 1.00 | 1.00 | 6.55 | 0.00 | 0.00 | 0.00 | 0.00 | 121.00 | ▇▁▁▁▁ |
| Combined MPG (Fuel Type 2) | 0 | 1.00 | 0.90 | 6.43 | 0.00 | 0.00 | 0.00 | 0.00 | 133.00 | ▇▁▁▁▁ |
| Engine Cylinders | 487 | 0.99 | 5.71 | 1.77 | 2.00 | 4.00 | 6.00 | 6.00 | 16.00 | ▇▇▅▁▁ |
| Engine Displacement | 485 | 0.99 | 3.28 | 1.36 | 0.00 | 2.20 | 3.00 | 4.20 | 8.40 | ▁▇▅▂▁ |
| Time to Charge EV (hours at 120v) | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| Time to Charge EV (hours at 240v) | 0 | 1.00 | 0.11 | 1.01 | 0.00 | 0.00 | 0.00 | 0.00 | 15.30 | ▇▁▁▁▁ |
| Range (for EV) | 0 | 1.00 | 2.36 | 24.97 | 0.00 | 0.00 | 0.00 | 0.00 | 520.00 | ▇▁▁▁▁ |
| City Range (for EV - Fuel Type 1) | 0 | 1.00 | 1.62 | 20.89 | 0.00 | 0.00 | 0.00 | 0.00 | 520.80 | ▇▁▁▁▁ |
| City Range (for EV - Fuel Type 2) | 0 | 1.00 | 0.17 | 2.73 | 0.00 | 0.00 | 0.00 | 0.00 | 135.28 | ▇▁▁▁▁ |
| Hwy Range (for EV - Fuel Type 1) | 0 | 1.00 | 1.51 | 19.70 | 0.00 | 0.00 | 0.00 | 0.00 | 520.50 | ▇▁▁▁▁ |
| Hwy Range (for EV - Fuel Type 2) | 0 | 1.00 | 0.16 | 2.46 | 0.00 | 0.00 | 0.00 | 0.00 | 114.76 | ▇▁▁▁▁ |
The dataset that we are working with contains approx. 46’000 rows and 26 columns. We can see that most of our features are concerning the consumption of the cars. In addition, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will be handle these in the section “data cleaning”.
Exploration of the distribution Here are more details about the distribution of the numerical features. ::: {.cell}
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 ::: ::: {.cell hash=‘report_html_cache/html/unnamed-chunk-6_6fdfa5f4fe30c3a8f031ecf732357900’}
Show the code
#tentative boxplots
# data_long <- data %>%
# select_if(is.numeric) %>%
# pivot_longer(cols = c("ID",
# "Model.Year",
# "Estimated.Annual.Petrolum.Consumption..Barrels.",
# "City.MPG..Fuel.Type.1.",
# "Highway.MPG..Fuel.Type.1." ,
# "Combined.MPG..Fuel.Type.1.",
# "City.MPG..Fuel.Type.2." ,
# "Highway.MPG..Fuel.Type.2.",
# "Combined.MPG..Fuel.Type.2." ,
# "Time.to.Charge.EV..hours.at.120v." ,
# "Time.to.Charge.EV..hours.at.240v." ,
# "Range..for.EV.",
# "City.Range..for.EV...Fuel.Type.1.",
# "City.Range..for.EV...Fuel.Type.2.",
# "Hwy.Range..for.EV...Fuel.Type.1." ,
# "Hwy.Range..for.EV...Fuel.Type.2." ), names_to = "variable", values_to = "value")
#
# ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
# geom_boxplot() +
# facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
# theme_minimal() +
# labs(title = "Boxplots of Variables with Different Scales", x = "", y = "Value"):::
Show the code
#Now
# plot_correlation(data) #drop time charge EV 120V
# create_report(data)
#nb cars per brandnumber of models per make ::: {.cell}
Show the code
#Number of occurences/model per make
nb_model_per_make <- data %>%
group_by(Make, Model) %>%
summarise(Number = n(), .groups = 'drop') %>%
group_by(Make) %>%
summarise(Models_Per_Make = n(), .groups = 'drop') %>%
arrange(desc(Models_Per_Make))
#table
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 10,
class = "hover",
searchHighlight = TRUE))Show the code
# Reordering the Make variable within the plotting code to make it ordered by Models_Per_Make descending
nb_model_per_make$Make <- factor(nb_model_per_make$Make, levels = nb_model_per_make$Make[order(-nb_model_per_make$Models_Per_Make)])
# Bar Plot
ggplot(nb_model_per_make, aes(x = Models_Per_Make, y = Make, fill = Make)) +
geom_bar(stat = "identity", color = "black", show.legend = FALSE) +
labs(title = "Bar Plot of Models per Make",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x labels for better visibility:::
Correlation matrix for numerical features ::: {.cell}
Show the code
library(corrplot)corrplot 0.92 loaded
Show the code
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Show the code
#select only numerical columns, drop Time.to.Charge.EV..hours.at.120v. because NAs
data_corrplot <- data %>%
select_if(is.numeric)
#correlation transformation for plot
cor_matrix <- cor(data_corrplot, use = "complete.obs")Warning in cor(data_corrplot, use = "complete.obs"): the standard deviation is
zero
Show the code
print(cor_matrix) ID Model Year
ID 1.000000000 0.898172038
Model Year 0.898172038 1.000000000
Estimated Annual Petrolum Consumption (Barrels) -0.266893652 -0.276815795
City MPG (Fuel Type 1) 0.241377134 0.228012823
Highway MPG (Fuel Type 1) 0.283084314 0.291126991
Combined MPG (Fuel Type 1) 0.261835243 0.256301039
City MPG (Fuel Type 2) 0.134630658 0.134178928
Highway MPG (Fuel Type 2) 0.146379198 0.147794175
Combined MPG (Fuel Type 2) 0.140054478 0.140327837
Engine Cylinders 0.033692800 0.050183765
Engine Displacement -0.003199825 0.003489418
Time to Charge EV (hours at 120v) NA NA
Time to Charge EV (hours at 240v) 0.105358751 0.099465867
Range (for EV) NA NA
City Range (for EV - Fuel Type 1) NA NA
City Range (for EV - Fuel Type 2) 0.087660644 0.082763271
Hwy Range (for EV - Fuel Type 1) NA NA
Hwy Range (for EV - Fuel Type 2) 0.091333071 0.086200984
Estimated Annual Petrolum Consumption (Barrels)
ID -0.2668937
Model Year -0.2768158
Estimated Annual Petrolum Consumption (Barrels) 1.0000000
City MPG (Fuel Type 1) -0.8653379
Highway MPG (Fuel Type 1) -0.9035610
Combined MPG (Fuel Type 1) -0.9001868
City MPG (Fuel Type 2) -0.1689445
Highway MPG (Fuel Type 2) -0.1539511
Combined MPG (Fuel Type 2) -0.1637478
Engine Cylinders 0.7331933
Engine Displacement 0.7837093
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) -0.1859223
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) -0.1727049
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) -0.1766306
City MPG (Fuel Type 1)
ID 0.2413771
Model Year 0.2280128
Estimated Annual Petrolum Consumption (Barrels) -0.8653379
City MPG (Fuel Type 1) 1.0000000
Highway MPG (Fuel Type 1) 0.9207665
Combined MPG (Fuel Type 1) 0.9857637
City MPG (Fuel Type 2) 0.1671384
Highway MPG (Fuel Type 2) 0.1420879
Combined MPG (Fuel Type 2) 0.1574358
Engine Cylinders -0.6771928
Engine Displacement -0.7115445
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.1414919
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.1530046
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.1510883
Highway MPG (Fuel Type 1)
ID 0.28308431
Model Year 0.29112699
Estimated Annual Petrolum Consumption (Barrels) -0.90356096
City MPG (Fuel Type 1) 0.92076647
Highway MPG (Fuel Type 1) 1.00000000
Combined MPG (Fuel Type 1) 0.96771702
City MPG (Fuel Type 2) 0.08903949
Highway MPG (Fuel Type 2) 0.07514936
Combined MPG (Fuel Type 2) 0.08359213
Engine Cylinders -0.64689904
Engine Displacement -0.70631422
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.07127195
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.08014444
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.07924589
Combined MPG (Fuel Type 1)
ID 0.2618352
Model Year 0.2563010
Estimated Annual Petrolum Consumption (Barrels) -0.9001868
City MPG (Fuel Type 1) 0.9857637
Highway MPG (Fuel Type 1) 0.9677170
Combined MPG (Fuel Type 1) 1.0000000
City MPG (Fuel Type 2) 0.1365411
Highway MPG (Fuel Type 2) 0.1157063
Combined MPG (Fuel Type 2) 0.1284624
Engine Cylinders -0.6825224
Engine Displacement -0.7267671
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.1153313
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.1256720
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.1242627
City MPG (Fuel Type 2)
ID 0.13463066
Model Year 0.13417893
Estimated Annual Petrolum Consumption (Barrels) -0.16894449
City MPG (Fuel Type 1) 0.16713838
Highway MPG (Fuel Type 1) 0.08903949
Combined MPG (Fuel Type 1) 0.13654114
City MPG (Fuel Type 2) 1.00000000
Highway MPG (Fuel Type 2) 0.98322734
Combined MPG (Fuel Type 2) 0.99700069
Engine Cylinders -0.02312181
Engine Displacement -0.02485931
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.83003369
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.79927886
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.81001967
Highway MPG (Fuel Type 2)
ID 0.146379198
Model Year 0.147794175
Estimated Annual Petrolum Consumption (Barrels) -0.153951086
City MPG (Fuel Type 1) 0.142087885
Highway MPG (Fuel Type 1) 0.075149357
Combined MPG (Fuel Type 1) 0.115706313
City MPG (Fuel Type 2) 0.983227343
Highway MPG (Fuel Type 2) 1.000000000
Combined MPG (Fuel Type 2) 0.994192508
Engine Cylinders -0.005619978
Engine Displacement -0.005966497
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.799168034
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.739164873
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.755992195
Combined MPG (Fuel Type 2)
ID 0.14005448
Model Year 0.14032784
Estimated Annual Petrolum Consumption (Barrels) -0.16374776
City MPG (Fuel Type 1) 0.15743582
Highway MPG (Fuel Type 1) 0.08359213
Combined MPG (Fuel Type 1) 0.12846237
City MPG (Fuel Type 2) 0.99700069
Highway MPG (Fuel Type 2) 0.99419251
Combined MPG (Fuel Type 2) 1.00000000
Engine Cylinders -0.01618583
Engine Displacement -0.01738760
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.82239662
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.77761086
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.79135465
Engine Cylinders
ID 0.033692800
Model Year 0.050183765
Estimated Annual Petrolum Consumption (Barrels) 0.733193261
City MPG (Fuel Type 1) -0.677192760
Highway MPG (Fuel Type 1) -0.646899039
Combined MPG (Fuel Type 1) -0.682522397
City MPG (Fuel Type 2) -0.023121807
Highway MPG (Fuel Type 2) -0.005619978
Combined MPG (Fuel Type 2) -0.016185826
Engine Cylinders 1.000000000
Engine Displacement 0.905190858
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) -0.049696335
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) -0.057700272
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) -0.057492631
Engine Displacement
ID -0.003199825
Model Year 0.003489418
Estimated Annual Petrolum Consumption (Barrels) 0.783709304
City MPG (Fuel Type 1) -0.711544513
Highway MPG (Fuel Type 1) -0.706314224
Combined MPG (Fuel Type 1) -0.726767140
City MPG (Fuel Type 2) -0.024859311
Highway MPG (Fuel Type 2) -0.005966497
Combined MPG (Fuel Type 2) -0.017387603
Engine Cylinders 0.905190858
Engine Displacement 1.000000000
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) -0.060216571
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) -0.062930177
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) -0.063488571
Time to Charge EV (hours at 120v)
ID NA
Model Year NA
Estimated Annual Petrolum Consumption (Barrels) NA
City MPG (Fuel Type 1) NA
Highway MPG (Fuel Type 1) NA
Combined MPG (Fuel Type 1) NA
City MPG (Fuel Type 2) NA
Highway MPG (Fuel Type 2) NA
Combined MPG (Fuel Type 2) NA
Engine Cylinders NA
Engine Displacement NA
Time to Charge EV (hours at 120v) 1
Time to Charge EV (hours at 240v) NA
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) NA
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) NA
Time to Charge EV (hours at 240v)
ID 0.10535875
Model Year 0.09946587
Estimated Annual Petrolum Consumption (Barrels) -0.18592232
City MPG (Fuel Type 1) 0.14149189
Highway MPG (Fuel Type 1) 0.07127195
Combined MPG (Fuel Type 1) 0.11533134
City MPG (Fuel Type 2) 0.83003369
Highway MPG (Fuel Type 2) 0.79916803
Combined MPG (Fuel Type 2) 0.82239662
Engine Cylinders -0.04969633
Engine Displacement -0.06021657
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 1.00000000
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.88124788
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.90893793
Range (for EV)
ID NA
Model Year NA
Estimated Annual Petrolum Consumption (Barrels) NA
City MPG (Fuel Type 1) NA
Highway MPG (Fuel Type 1) NA
Combined MPG (Fuel Type 1) NA
City MPG (Fuel Type 2) NA
Highway MPG (Fuel Type 2) NA
Combined MPG (Fuel Type 2) NA
Engine Cylinders NA
Engine Displacement NA
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) NA
Range (for EV) 1
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) NA
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) NA
City Range (for EV - Fuel Type 1)
ID NA
Model Year NA
Estimated Annual Petrolum Consumption (Barrels) NA
City MPG (Fuel Type 1) NA
Highway MPG (Fuel Type 1) NA
Combined MPG (Fuel Type 1) NA
City MPG (Fuel Type 2) NA
Highway MPG (Fuel Type 2) NA
Combined MPG (Fuel Type 2) NA
Engine Cylinders NA
Engine Displacement NA
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) NA
Range (for EV) NA
City Range (for EV - Fuel Type 1) 1
City Range (for EV - Fuel Type 2) NA
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) NA
City Range (for EV - Fuel Type 2)
ID 0.08766064
Model Year 0.08276327
Estimated Annual Petrolum Consumption (Barrels) -0.17270494
City MPG (Fuel Type 1) 0.15300461
Highway MPG (Fuel Type 1) 0.08014444
Combined MPG (Fuel Type 1) 0.12567199
City MPG (Fuel Type 2) 0.79927886
Highway MPG (Fuel Type 2) 0.73916487
Combined MPG (Fuel Type 2) 0.77761086
Engine Cylinders -0.05770027
Engine Displacement -0.06293018
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.88124788
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 1.00000000
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 0.99601860
Hwy Range (for EV - Fuel Type 1)
ID NA
Model Year NA
Estimated Annual Petrolum Consumption (Barrels) NA
City MPG (Fuel Type 1) NA
Highway MPG (Fuel Type 1) NA
Combined MPG (Fuel Type 1) NA
City MPG (Fuel Type 2) NA
Highway MPG (Fuel Type 2) NA
Combined MPG (Fuel Type 2) NA
Engine Cylinders NA
Engine Displacement NA
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) NA
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) NA
Hwy Range (for EV - Fuel Type 1) 1
Hwy Range (for EV - Fuel Type 2) NA
Hwy Range (for EV - Fuel Type 2)
ID 0.09133307
Model Year 0.08620098
Estimated Annual Petrolum Consumption (Barrels) -0.17663055
City MPG (Fuel Type 1) 0.15108834
Highway MPG (Fuel Type 1) 0.07924589
Combined MPG (Fuel Type 1) 0.12426275
City MPG (Fuel Type 2) 0.81001967
Highway MPG (Fuel Type 2) 0.75599220
Combined MPG (Fuel Type 2) 0.79135465
Engine Cylinders -0.05749263
Engine Displacement -0.06348857
Time to Charge EV (hours at 120v) NA
Time to Charge EV (hours at 240v) 0.90893793
Range (for EV) NA
City Range (for EV - Fuel Type 1) NA
City Range (for EV - Fuel Type 2) 0.99601860
Hwy Range (for EV - Fuel Type 1) NA
Hwy Range (for EV - Fuel Type 2) 1.00000000
Show the code
kable(cor_matrix)| ID | Model Year | Estimated Annual Petrolum Consumption (Barrels) | City MPG (Fuel Type 1) | Highway MPG (Fuel Type 1) | Combined MPG (Fuel Type 1) | City MPG (Fuel Type 2) | Highway MPG (Fuel Type 2) | Combined MPG (Fuel Type 2) | Engine Cylinders | Engine Displacement | Time to Charge EV (hours at 120v) | Time to Charge EV (hours at 240v) | Range (for EV) | City Range (for EV - Fuel Type 1) | City Range (for EV - Fuel Type 2) | Hwy Range (for EV - Fuel Type 1) | Hwy Range (for EV - Fuel Type 2) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1.0000000 | 0.8981720 | -0.2668937 | 0.2413771 | 0.2830843 | 0.2618352 | 0.1346307 | 0.1463792 | 0.1400545 | 0.0336928 | -0.0031998 | NA | 0.1053588 | NA | NA | 0.0876606 | NA | 0.0913331 |
| Model Year | 0.8981720 | 1.0000000 | -0.2768158 | 0.2280128 | 0.2911270 | 0.2563010 | 0.1341789 | 0.1477942 | 0.1403278 | 0.0501838 | 0.0034894 | NA | 0.0994659 | NA | NA | 0.0827633 | NA | 0.0862010 |
| Estimated Annual Petrolum Consumption (Barrels) | -0.2668937 | -0.2768158 | 1.0000000 | -0.8653379 | -0.9035610 | -0.9001868 | -0.1689445 | -0.1539511 | -0.1637478 | 0.7331933 | 0.7837093 | NA | -0.1859223 | NA | NA | -0.1727049 | NA | -0.1766306 |
| City MPG (Fuel Type 1) | 0.2413771 | 0.2280128 | -0.8653379 | 1.0000000 | 0.9207665 | 0.9857637 | 0.1671384 | 0.1420879 | 0.1574358 | -0.6771928 | -0.7115445 | NA | 0.1414919 | NA | NA | 0.1530046 | NA | 0.1510883 |
| Highway MPG (Fuel Type 1) | 0.2830843 | 0.2911270 | -0.9035610 | 0.9207665 | 1.0000000 | 0.9677170 | 0.0890395 | 0.0751494 | 0.0835921 | -0.6468990 | -0.7063142 | NA | 0.0712719 | NA | NA | 0.0801444 | NA | 0.0792459 |
| Combined MPG (Fuel Type 1) | 0.2618352 | 0.2563010 | -0.9001868 | 0.9857637 | 0.9677170 | 1.0000000 | 0.1365411 | 0.1157063 | 0.1284624 | -0.6825224 | -0.7267671 | NA | 0.1153313 | NA | NA | 0.1256720 | NA | 0.1242627 |
| City MPG (Fuel Type 2) | 0.1346307 | 0.1341789 | -0.1689445 | 0.1671384 | 0.0890395 | 0.1365411 | 1.0000000 | 0.9832273 | 0.9970007 | -0.0231218 | -0.0248593 | NA | 0.8300337 | NA | NA | 0.7992789 | NA | 0.8100197 |
| Highway MPG (Fuel Type 2) | 0.1463792 | 0.1477942 | -0.1539511 | 0.1420879 | 0.0751494 | 0.1157063 | 0.9832273 | 1.0000000 | 0.9941925 | -0.0056200 | -0.0059665 | NA | 0.7991680 | NA | NA | 0.7391649 | NA | 0.7559922 |
| Combined MPG (Fuel Type 2) | 0.1400545 | 0.1403278 | -0.1637478 | 0.1574358 | 0.0835921 | 0.1284624 | 0.9970007 | 0.9941925 | 1.0000000 | -0.0161858 | -0.0173876 | NA | 0.8223966 | NA | NA | 0.7776109 | NA | 0.7913547 |
| Engine Cylinders | 0.0336928 | 0.0501838 | 0.7331933 | -0.6771928 | -0.6468990 | -0.6825224 | -0.0231218 | -0.0056200 | -0.0161858 | 1.0000000 | 0.9051909 | NA | -0.0496963 | NA | NA | -0.0577003 | NA | -0.0574926 |
| Engine Displacement | -0.0031998 | 0.0034894 | 0.7837093 | -0.7115445 | -0.7063142 | -0.7267671 | -0.0248593 | -0.0059665 | -0.0173876 | 0.9051909 | 1.0000000 | NA | -0.0602166 | NA | NA | -0.0629302 | NA | -0.0634886 |
| Time to Charge EV (hours at 120v) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA |
| Time to Charge EV (hours at 240v) | 0.1053588 | 0.0994659 | -0.1859223 | 0.1414919 | 0.0712719 | 0.1153313 | 0.8300337 | 0.7991680 | 0.8223966 | -0.0496963 | -0.0602166 | NA | 1.0000000 | NA | NA | 0.8812479 | NA | 0.9089379 |
| Range (for EV) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA |
| City Range (for EV - Fuel Type 1) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA |
| City Range (for EV - Fuel Type 2) | 0.0876606 | 0.0827633 | -0.1727049 | 0.1530046 | 0.0801444 | 0.1256720 | 0.7992789 | 0.7391649 | 0.7776109 | -0.0577003 | -0.0629302 | NA | 0.8812479 | NA | NA | 1.0000000 | NA | 0.9960186 |
| Hwy Range (for EV - Fuel Type 1) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA |
| Hwy Range (for EV - Fuel Type 2) | 0.0913331 | 0.0862010 | -0.1766306 | 0.1510883 | 0.0792459 | 0.1242627 | 0.8100197 | 0.7559922 | 0.7913547 | -0.0574926 | -0.0634886 | NA | 0.9089379 | NA | NA | 0.9960186 | NA | 1.0000000 |
Show the code
cor_melted <- melt(cor_matrix)
#plot
ggplot(data = cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1),
axis.text.y = element_text(size = 8)) +
coord_fixed() +
labs(x = '', y = '', title = 'Correlation Matrix Heatmap'):::
3 Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Let’s have a look at the columns and rows with missing values more in details.
We can now more easily see the missing in our data. Below we have the detail of the pourcentage of missing values by columns.
Let’s first have a closer look at the engine cylinders and engine displacement columns.
We see that all the {r} miss_elec missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type is only “{r} fuel_type_1_miss”. Therefore, we can conclude that all the missing values in “Engine Cylinders” and “Engine Displacement” vehicle fuel type represent all our electric vehicle. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
Show the code
# Create a summary dataframe of missing values by column
missing_summary_df2 <- data_cleaning %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df2,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)Show the code
# Count the missing 'Drive' values per brand
missing_drive_by_make <- data_cleaning %>%
filter(is.na(Drive)) %>%
count(Make)
# Get total counts per brand in the entire dataset
total_counts_by_make <- data_cleaning %>%
count(Make)
# Calculate the percentage of missing 'Drive' values per brand
percentage_missing_drive_by_make <- missing_drive_by_make %>%
left_join(total_counts_by_make, by = "Make", suffix = c(".missing", ".total")) %>%
mutate(PercentageMissing = (n.missing / n.total)) %>%
arrange(desc(PercentageMissing))
# Print the summary dataframe
datatable(percentage_missing_drive_by_make,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("PercentageMissing", 2)Show the code
# Calculate the percentage of missing 'Drive' values per brand
brand_summary <- data_cleaning %>%
group_by(Make) %>%
summarise(Total = n(),
Missing = sum(is.na(Drive)),
PercentageMissing = (Missing / Total))
# Identify brands with more than 10% missing 'Drive' data
brands_to_remove <- brand_summary %>%
filter(PercentageMissing > brand_missing_threshold) %>%
pull(Make)
# Filter out these brands from the dataset
data_filtered <- data_cleaning %>%
filter(!(Make %in% brands_to_remove))
# For the remaining data, drop rows with missing 'Drive' values
data_cleaning2 <- data_filtered %>%
filter(!is.na(Drive))Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning2 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df3,
options = list(pageLength = 6,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)Show the code
# Remove rows where the 'Transmission' column has missing values
data_cleaning3 <- data_cleaning2 %>%
filter(!is.na(Transmission))
data_cleaning4 <- data_cleaning3 %>%
mutate(Fuel.Type.2 = replace_na(Fuel.Type.2, "none"))Show the code
# Create a summary dataframe of missing values by column
missing_summary_df3 <- data_cleaning3 %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Column", values_to = "Missing_Count") %>%
mutate(
Total_Rows = nrow(data_cleaning3),
Proportion_Missing = Missing_Count / Total_Rows
) %>%
arrange(desc(Proportion_Missing)) %>%
select(Column, "Missing values" = Missing_Count, "Prop. Missing" = Proportion_Missing)
# Print the summary dataframe
datatable(missing_summary_df3,
options = list(pageLength = 3,
class = "hover",
searchHighlight = TRUE),
rownames = FALSE)%>%
formatPercentage("Prop. Missing", 2)4 Classification Tree
In this section we are going to perform a classification tree analysis on the dataset. We will first load the necessary packages and the dataset. We then prepare the data by encoding categorical variables and splitting it into training and testing sets. We then tried to pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.
We first loaded the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.
We then splited the dataset into training (80%) and testing (20%) sets to be able to evaluate the model’s performance on unseen data after the training to check wheter the model is overfitting or not. We will see that it does.
Trained a Decision Tree classifier on the training data without any constraints. The “None” case below represent the case without the pruning of the tree. As we can see, we observed overfitting, with high accuracy on training data and slightly lower accuracy on test data. Therefore, we decided to prune the tree as it as the advantage so simplify models and therefore limit overfitting. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (none, 5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy.
max_depth Training Accuracy Test Accuracy
5 0.2605 0.2550
10 0.4887 0.4677
15 0.7205 0.6349
20 0.8519 0.6938
25 0.8899 0.7000
30 0.8939 0.6992
None 0.8939 0.6984
The model’s accuracy improved as the tree’s depth increased up to a point, with a max_depth of 25 or 30 providing the best test accuracy up to 70%. We see that reducing the max_depth to 10 or 15 improves the balance between, therefore reduce drastically the case of overfitting but this is at the expense of the accuracy of our model on new data. But we can see that pruning the tree with a max depth of 25 allows us to increase our accuracy from 69.84% to 70% therefore increasing the accuracy of our model and at the same time, it reduce the gap between the test set and the trainig set. In our case, pruning the Decision Tree helps in improving its generalization performance by preventing it from becoming too complex and reduce overfitting the training data.
5 Neural Network
Show the code
import pandas as pd
import numpy as np
from pyprojroot.here import here
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.utils import to_categorical
# Load the data
data = pd.read_csv(here("data/data_cleaned.csv"))
# Display the structure of the data
print(data.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42240 entries, 0 to 42239
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 make 42240 non-null object
1 model_year 42240 non-null int64
2 vehicle_class 42240 non-null object
3 drive 42240 non-null object
4 engine_cylinders 42237 non-null object
5 engine_displacement 42238 non-null object
6 transmission 42240 non-null object
7 fuel_type_1 42240 non-null object
8 city_mpg_fuel_type_1 42240 non-null int64
9 highway_mpg_fuel_type_1 42240 non-null int64
10 fuel_type_2 42240 non-null object
11 city_mpg_fuel_type_2 42240 non-null int64
12 highway_mpg_fuel_type_2 42240 non-null int64
13 range_ev_city_fuel_type_1 42240 non-null int64
14 range_ev_highway_fuel_type_1 42240 non-null float64
15 range_ev_city_fuel_type_2 42240 non-null int64
16 range_ev_highway_fuel_type_2 42240 non-null float64
17 charge_time_240v 42240 non-null float64
dtypes: float64(3), int64(7), object(8)
memory usage: 5.8+ MB
None
Show the code
# Display the first few rows of the data
print(data.head()) make model_year ... range_ev_highway_fuel_type_2 charge_time_240v
0 Alfa Romeo 1985 ... 0.0 0.0
1 Chevrolet 1985 ... 0.0 0.0
2 Chevrolet 1985 ... 0.0 0.0
3 Nissan 1985 ... 0.0 0.0
4 Nissan 1985 ... 0.0 0.0
[5 rows x 18 columns]
Show the code
# Identify categorical and numerical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Remove the target column 'make' from the features list
if 'make' in categorical_cols:
categorical_cols.remove('make')
if 'make' in numerical_cols:
numerical_cols.remove('make')
print(f"Categorical columns: {categorical_cols}")Categorical columns: ['vehicle_class', 'drive', 'engine_cylinders', 'engine_displacement', 'transmission', 'fuel_type_1', 'fuel_type_2']
Show the code
print(f"Numerical columns: {numerical_cols}")Numerical columns: ['model_year', 'city_mpg_fuel_type_1', 'highway_mpg_fuel_type_1', 'city_mpg_fuel_type_2', 'highway_mpg_fuel_type_2', 'range_ev_city_fuel_type_1', 'range_ev_highway_fuel_type_1', 'range_ev_city_fuel_type_2', 'range_ev_highway_fuel_type_2', 'charge_time_240v']
Show the code
# Define the preprocessing steps for numerical and categorical columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(sparse_output=False), categorical_cols) # Set sparse_output to False
])
# Split data into features and target
X = data.drop('make', axis=1)
y = data['make']
# Apply preprocessing and split data into training and testing sets
X_preprocessed = preprocessor.fit_transform(X)
# Encode the target variable
y_encoded = pd.get_dummies(y).values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_encoded, test_size=0.2, random_state=123)
# Define the neural network model
model = Sequential([
Input(shape=(X_train.shape[1],)),
Dense(128, activation='relu'),
Dropout(0.2),
Dense(64, activation='relu'),
Dropout(0.2),
Dense(y_train.shape[1], activation='softmax')
])
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)Epoch 1/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m7:28[0m 532ms/step - accuracy: 0.0312 - loss: 4.8831
[1m 71/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 722us/step - accuracy: 0.0666 - loss: 4.5354
[1m155/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 656us/step - accuracy: 0.0911 - loss: 4.1518
[1m240/845[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 632us/step - accuracy: 0.1108 - loss: 3.9163
[1m326/845[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 620us/step - accuracy: 0.1286 - loss: 3.7437
[1m411/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 614us/step - accuracy: 0.1449 - loss: 3.6078
[1m497/845[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 609us/step - accuracy: 0.1599 - loss: 3.4945
[1m584/845[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 605us/step - accuracy: 0.1735 - loss: 3.3984
[1m670/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 602us/step - accuracy: 0.1855 - loss: 3.3163
[1m757/845[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 599us/step - accuracy: 0.1967 - loss: 3.2425
[1m844/845[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 597us/step - accuracy: 0.2072 - loss: 3.1766
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 782us/step - accuracy: 0.2074 - loss: 3.1751 - val_accuracy: 0.4678 - val_loss: 1.8306
Epoch 2/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.6562 - loss: 1.4805
[1m 79/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 646us/step - accuracy: 0.4628 - loss: 1.8756
[1m129/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 792us/step - accuracy: 0.4570 - loss: 1.8778
[1m196/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 779us/step - accuracy: 0.4548 - loss: 1.8776
[1m282/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 720us/step - accuracy: 0.4526 - loss: 1.8754
[1m368/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 689us/step - accuracy: 0.4519 - loss: 1.8723
[1m455/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 668us/step - accuracy: 0.4525 - loss: 1.8655
[1m542/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 653us/step - accuracy: 0.4537 - loss: 1.8573
[1m629/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 643us/step - accuracy: 0.4548 - loss: 1.8495
[1m715/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 636us/step - accuracy: 0.4559 - loss: 1.8415
[1m799/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 633us/step - accuracy: 0.4572 - loss: 1.8341
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 727us/step - accuracy: 0.4579 - loss: 1.8302 - val_accuracy: 0.5328 - val_loss: 1.4940
Epoch 3/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.4688 - loss: 1.5307
[1m 85/845[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 600us/step - accuracy: 0.5035 - loss: 1.6000
[1m173/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 587us/step - accuracy: 0.5090 - loss: 1.5966
[1m261/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 583us/step - accuracy: 0.5081 - loss: 1.5969
[1m347/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 583us/step - accuracy: 0.5074 - loss: 1.5963
[1m435/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 580us/step - accuracy: 0.5075 - loss: 1.5954
[1m521/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 582us/step - accuracy: 0.5081 - loss: 1.5925
[1m608/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 581us/step - accuracy: 0.5087 - loss: 1.5895
[1m695/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 581us/step - accuracy: 0.5093 - loss: 1.5865
[1m772/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 588us/step - accuracy: 0.5097 - loss: 1.5840
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 726us/step - accuracy: 0.5103 - loss: 1.5813 - val_accuracy: 0.5724 - val_loss: 1.3251
Epoch 4/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 12ms/step - accuracy: 0.5625 - loss: 1.2849
[1m 80/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 639us/step - accuracy: 0.5315 - loss: 1.4409
[1m165/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 613us/step - accuracy: 0.5319 - loss: 1.4420
[1m247/845[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 613us/step - accuracy: 0.5336 - loss: 1.4419
[1m333/845[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 607us/step - accuracy: 0.5346 - loss: 1.4417
[1m420/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 601us/step - accuracy: 0.5355 - loss: 1.4395
[1m508/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 596us/step - accuracy: 0.5362 - loss: 1.4376
[1m594/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 594us/step - accuracy: 0.5368 - loss: 1.4360
[1m681/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 592us/step - accuracy: 0.5374 - loss: 1.4343
[1m766/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 592us/step - accuracy: 0.5378 - loss: 1.4329
[1m844/845[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 597us/step - accuracy: 0.5382 - loss: 1.4316
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 698us/step - accuracy: 0.5382 - loss: 1.4315 - val_accuracy: 0.5912 - val_loss: 1.2380
Epoch 5/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.6562 - loss: 1.1215
[1m 82/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 625us/step - accuracy: 0.5450 - loss: 1.3962
[1m161/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 630us/step - accuracy: 0.5509 - loss: 1.3727
[1m248/845[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 612us/step - accuracy: 0.5536 - loss: 1.3629
[1m335/845[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 604us/step - accuracy: 0.5564 - loss: 1.3548
[1m422/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 598us/step - accuracy: 0.5578 - loss: 1.3500
[1m509/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 595us/step - accuracy: 0.5585 - loss: 1.3471
[1m598/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 591us/step - accuracy: 0.5590 - loss: 1.3450
[1m682/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 592us/step - accuracy: 0.5593 - loss: 1.3438
[1m770/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 590us/step - accuracy: 0.5595 - loss: 1.3431
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 685us/step - accuracy: 0.5596 - loss: 1.3428 - val_accuracy: 0.6072 - val_loss: 1.1585
Epoch 6/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.5938 - loss: 1.3408
[1m 84/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 604us/step - accuracy: 0.5867 - loss: 1.2603
[1m171/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 591us/step - accuracy: 0.5809 - loss: 1.2709
[1m259/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 584us/step - accuracy: 0.5798 - loss: 1.2733
[1m346/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 582us/step - accuracy: 0.5785 - loss: 1.2762
[1m434/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 580us/step - accuracy: 0.5778 - loss: 1.2780
[1m522/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 578us/step - accuracy: 0.5772 - loss: 1.2789
[1m610/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 577us/step - accuracy: 0.5768 - loss: 1.2791
[1m697/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 577us/step - accuracy: 0.5764 - loss: 1.2791
[1m785/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 576us/step - accuracy: 0.5761 - loss: 1.2788
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 672us/step - accuracy: 0.5761 - loss: 1.2784 - val_accuracy: 0.6158 - val_loss: 1.1046
Epoch 7/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.5938 - loss: 1.3405
[1m 85/845[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 602us/step - accuracy: 0.5782 - loss: 1.2464
[1m173/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 586us/step - accuracy: 0.5733 - loss: 1.2495
[1m261/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 581us/step - accuracy: 0.5754 - loss: 1.2446
[1m346/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 584us/step - accuracy: 0.5770 - loss: 1.2398
[1m431/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 585us/step - accuracy: 0.5780 - loss: 1.2378
[1m518/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 583us/step - accuracy: 0.5789 - loss: 1.2352
[1m605/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 582us/step - accuracy: 0.5795 - loss: 1.2329
[1m692/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 582us/step - accuracy: 0.5801 - loss: 1.2310
[1m779/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 582us/step - accuracy: 0.5806 - loss: 1.2297
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 677us/step - accuracy: 0.5809 - loss: 1.2288 - val_accuracy: 0.6260 - val_loss: 1.0641
Epoch 8/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.5000 - loss: 1.4012
[1m 85/845[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 603us/step - accuracy: 0.5877 - loss: 1.1993
[1m173/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 588us/step - accuracy: 0.5937 - loss: 1.1817
[1m261/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 583us/step - accuracy: 0.5963 - loss: 1.1764
[1m350/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 579us/step - accuracy: 0.5968 - loss: 1.1748
[1m437/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 578us/step - accuracy: 0.5968 - loss: 1.1747
[1m525/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 577us/step - accuracy: 0.5967 - loss: 1.1749
[1m613/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 577us/step - accuracy: 0.5969 - loss: 1.1755
[1m701/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 576us/step - accuracy: 0.5971 - loss: 1.1755
[1m788/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 577us/step - accuracy: 0.5975 - loss: 1.1749
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 688us/step - accuracy: 0.5977 - loss: 1.1747 - val_accuracy: 0.6353 - val_loss: 1.0281
Epoch 9/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m10s[0m 12ms/step - accuracy: 0.7188 - loss: 0.9520
[1m 64/845[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 802us/step - accuracy: 0.6177 - loss: 1.1262
[1m150/845[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m0s[0m 677us/step - accuracy: 0.6138 - loss: 1.1328
[1m239/845[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 636us/step - accuracy: 0.6117 - loss: 1.1335
[1m326/845[0m [32m━━━━━━━[0m[37m━━━━━━━━━━━━━[0m [1m0s[0m 620us/step - accuracy: 0.6105 - loss: 1.1337
[1m412/845[0m [32m━━━━━━━━━[0m[37m━━━━━━━━━━━[0m [1m0s[0m 613us/step - accuracy: 0.6104 - loss: 1.1330
[1m499/845[0m [32m━━━━━━━━━━━[0m[37m━━━━━━━━━[0m [1m0s[0m 607us/step - accuracy: 0.6100 - loss: 1.1325
[1m586/845[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 603us/step - accuracy: 0.6095 - loss: 1.1326
[1m675/845[0m [32m━━━━━━━━━━━━━━━[0m[37m━━━━━[0m [1m0s[0m 598us/step - accuracy: 0.6090 - loss: 1.1329
[1m762/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 595us/step - accuracy: 0.6085 - loss: 1.1332
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 689us/step - accuracy: 0.6082 - loss: 1.1333 - val_accuracy: 0.6352 - val_loss: 0.9959
Epoch 10/10
[1m 1/845[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m9s[0m 11ms/step - accuracy: 0.5625 - loss: 1.1384
[1m 85/845[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 599us/step - accuracy: 0.6116 - loss: 1.0765
[1m171/845[0m [32m━━━━[0m[37m━━━━━━━━━━━━━━━━[0m [1m0s[0m 591us/step - accuracy: 0.6143 - loss: 1.0788
[1m257/845[0m [32m━━━━━━[0m[37m━━━━━━━━━━━━━━[0m [1m0s[0m 589us/step - accuracy: 0.6151 - loss: 1.0856
[1m344/845[0m [32m━━━━━━━━[0m[37m━━━━━━━━━━━━[0m [1m0s[0m 586us/step - accuracy: 0.6145 - loss: 1.0909
[1m431/845[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 585us/step - accuracy: 0.6137 - loss: 1.0953
[1m517/845[0m [32m━━━━━━━━━━━━[0m[37m━━━━━━━━[0m [1m0s[0m 585us/step - accuracy: 0.6133 - loss: 1.0977
[1m604/845[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m0s[0m 584us/step - accuracy: 0.6131 - loss: 1.0990
[1m691/845[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 583us/step - accuracy: 0.6131 - loss: 1.0995
[1m774/845[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 586us/step - accuracy: 0.6129 - loss: 1.1002
[1m845/845[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 685us/step - accuracy: 0.6129 - loss: 1.1004 - val_accuracy: 0.6491 - val_loss: 0.9588
Show the code
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
[1m 1/264[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2s[0m 9ms/step - accuracy: 0.6562 - loss: 0.7945
[1m144/264[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 351us/step - accuracy: 0.6353 - loss: 0.9848
[1m264/264[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 346us/step - accuracy: 0.6396 - loss: 0.9834
Show the code
print(f'Test accuracy: {accuracy}')Test accuracy: 0.6441761255264282
Show the code
# Make predictions
predictions = np.argmax(model.predict(X_test), axis=1)
[1m 1/264[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m5s[0m 22ms/step
[1m179/264[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m0s[0m 282us/step
[1m264/264[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 272us/step
Show the code
# Print predictions
print(predictions)[114 19 74 ... 36 74 36]
Show the code
# Plot the accuracy and loss
fig, axs = plt.subplots(2, 1, figsize=(10, 10))
# Plot training & validation accuracy values
axs[0].plot(history.history['accuracy'])
axs[0].plot(history.history['val_accuracy'])
axs[0].set_title('Model accuracy')
axs[0].set_ylabel('Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].legend(['Train', 'Validation'], loc='upper left')
# Plot training & validation loss values
axs[1].plot(history.history['loss'])
axs[1].plot(history.history['val_loss'])
axs[1].set_title('Model loss')
axs[1].set_ylabel('Loss')
axs[1].set_xlabel('Epoch')
axs[1].legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()Show the code
source(here::here("scripts","setup.R"))
library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:reshape2':
dcast, melt
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following object is masked from 'package:purrr':
transpose
The following objects are masked from 'package:dplyr':
between, first, last
Show the code
data_cleaned <- fread(here::here("data", "data_cleaned.csv"))In order to see the link between the features, we can use a dimension reduction technique such as the Principal Component Analysis, aiming to link the features according to their similarities accross instances and combine features in fewer dimensions.
6 Principal Component Analysis
6.1 Biplot
Show the code
# Assuming your data frame is named data_cleaned
data_prepared <- data_cleaned %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(where(is.factor), as.numeric)) %>%
scale() # Standardizes numeric data including converted factors
pca_results <- PCA(data_prepared, graph = FALSE)
summary(pca_results)
Call:
PCA(X = data_prepared, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
Variance 4.844 3.900 2.106 1.222 0.993 0.866 0.855
% of var. 26.913 21.666 11.700 6.790 5.519 4.811 4.750
Cumulative % of var. 26.913 48.579 60.279 67.069 72.588 77.399 82.149
Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
Variance 0.827 0.725 0.539 0.460 0.309 0.179 0.131
% of var. 4.595 4.028 2.996 2.557 1.718 0.992 0.729
Cumulative % of var. 86.744 90.773 93.769 96.326 98.044 99.036 99.765
Dim.15 Dim.16 Dim.17 Dim.18
Variance 0.034 0.008 0.000 0.000
% of var. 0.188 0.047 0.000 0.000
Cumulative % of var. 99.953 100.000 100.000 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr
1 | 3.335 | -1.044 0.001 0.098 | 0.068 0.000
2 | 3.410 | -1.208 0.001 0.125 | 0.197 0.000
3 | 3.544 | -1.448 0.001 0.167 | 0.268 0.000
4 | 2.789 | -1.245 0.001 0.199 | 0.155 0.000
5 | 2.742 | -1.166 0.001 0.181 | 0.112 0.000
6 | 2.742 | -1.166 0.001 0.181 | 0.112 0.000
7 | 2.855 | -1.129 0.001 0.156 | 0.029 0.000
8 | 2.903 | -1.317 0.001 0.206 | 0.126 0.000
9 | 4.943 | -2.152 0.002 0.190 | 0.605 0.000
10 | 3.448 | -2.115 0.002 0.376 | 0.596 0.000
cos2 Dim.3 ctr cos2
1 0.000 | -0.285 0.000 0.007 |
2 0.003 | 2.415 0.007 0.502 |
3 0.006 | 2.351 0.006 0.440 |
4 0.003 | 0.439 0.000 0.025 |
5 0.002 | 0.407 0.000 0.022 |
6 0.002 | 0.407 0.000 0.022 |
7 0.000 | 0.239 0.000 0.007 |
8 0.002 | 0.285 0.000 0.010 |
9 0.015 | -0.393 0.000 0.006 |
10 0.030 | 1.207 0.002 0.123 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2
make | 0.129 0.345 0.017 | -0.135 0.469 0.018 |
model_year | 0.375 2.900 0.141 | -0.003 0.000 0.000 |
vehicle_class | -0.176 0.643 0.031 | 0.098 0.244 0.010 |
drive | -0.048 0.047 0.002 | 0.013 0.004 0.000 |
engine_cylinders | 0.007 0.001 0.000 | 0.000 0.000 0.000 |
engine_displacement | 0.025 0.013 0.001 | -0.017 0.007 0.000 |
transmission | -0.494 5.028 0.244 | 0.110 0.310 0.012 |
fuel_type_1 | -0.391 3.149 0.153 | 0.247 1.568 0.061 |
city_mpg_fuel_type_1 | 0.868 15.542 0.753 | -0.397 4.047 0.158 |
highway_mpg_fuel_type_1 | 0.838 14.497 0.702 | -0.409 4.284 0.167 |
Dim.3 ctr cos2
make -0.445 9.407 0.198 |
model_year -0.033 0.053 0.001 |
vehicle_class 0.431 8.813 0.186 |
drive 0.148 1.044 0.022 |
engine_cylinders 0.772 28.267 0.595 |
engine_displacement 0.877 36.535 0.769 |
transmission -0.129 0.795 0.017 |
fuel_type_1 -0.282 3.771 0.079 |
city_mpg_fuel_type_1 -0.116 0.634 0.013 |
highway_mpg_fuel_type_1 -0.221 2.326 0.049 |
Show the code
fviz_pca_biplot(pca_results,
geom.ind = "point", # To show data points
geom.var = c("arrow", "text"), # To show variable vectors and labels
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), # Colors
repel = TRUE # Avoid text overlapping
)The biplot shows several information. - The colored dots represent the numerical observations of the dataset. - The cos2 gradient shows the representation of the feature by the dimension, so the higher the cos2 (tending to red), the better the representation of the observation in the dimension. - The arrows represent the features in the form of the circle of correlation. Here, we have 2 dimensions which represent almost 49% of the observations. - Looking at the arrows, it shows that most of variables are stongly linked to dimension 2. We can also see that the arrows that go in opposite directions (such as fuel_type_1 and highway_mpg_fuel_type_1) are negatively correlated. From another view, fuel_type_1 and fuel_type_2 are uncorrelated.
6.2 Screeplot
Show the code
# Generating the scree plot from PCA results
fviz_eig(pca_results,
addlabels = TRUE, # Adds labels to the plot indicating the percentage of variance
ylim = c(0, 100), # Optional: Sets the limits of the y-axis to make the plot easier to interpret
barfill = "lightblue", # Color of the bars
barcolor = "black", # Color of the borders of bars
main = "Scree Plot of PCA") # Title of the plotTaking the screeplot into account, 6 dimensions are needed to reach at least 75%, meaning the features might be relatively independent. It is alredy shown in the biplot above, as most arrows in the middle seem to be shorter and the cos2 are low, meaning that the features might be more linked to other dimensions than the first 2 dimensions. To check further the correlation, we can use a heatmap.
6.3 Heatmap
Show the code
library(reshape2)
# Assuming data_prepared has been previously defined and standardized
cor_matrix <- cor(data_prepared) # Calculate correlation matrix
# Melt the correlation matrix for ggplot2
melted_cor_matrix <- melt(cor_matrix)Warning: The melt generic in data.table has been passed a matrix and will
attempt to redirect to the relevant reshape2 method; please note that reshape2
is superseded and is no longer actively developed, and this redirection is now
deprecated. To continue using melt methods from reshape2 while both libraries
are attached, e.g. melt.list, you can prepend the namespace, i.e.
reshape2::melt(cor_matrix). In the next version, this warning will become an
error.
Show the code
# Heatmap with all correlation coefficients displayed
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") + # Add white lines to distinguish the tiles
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3.5) + # Always display labels
scale_fill_gradient2(low = "lightblue", high = "darkblue", mid = "blue", midpoint = 0, limit = c(-1,1),
name = "Spearman\nCorrelation") + # Use gradient2 for a diverging color scheme
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5), # Center the title
plot.title.position = "plot") +
labs(x = 'Variables', y = 'Variables',
title = 'Correlations Heatmap of Variables') # Adjust the title and labels as neededThis heatmap indicates the correlation between the variables. It shows that the correlations aren’t that strong between variables, expect for a few such as mighway_mpg_fuel and city_mpg_fuel.